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ABSTRACT 

RepeatsDB (http://repeatsdb.bio.unipd.it/) is a 
database of annotated tandem repeat protein struc- 
tures. Tandem repeats pose a difficult problem for 
the analysis of protein structures, as the underlying 
sequence can be highly degenerate. Several repeat 
types haven been studied over the years, but their 
annotation was done in a case-by-case basis, thus 
making large-scale analysis difficult. We developed 
RepeatsDB to fill this gap. Using state-of-the-art 
repeat detection methods and manual curation, we 
systematically annotated the Protein Data Bank, 
predicting 10745 repeat structures. In all, 2797 
structures were classified according to a recently 
proposed classification schema, which was 
expanded to accommodate new findings. In 
addition, detailed annotations were performed in a 
subset of 321 proteins. These annotations feature 
information on start and end positions for the 
repeat regions and units. RepeatsDB is an ongoing 
effort to systematically classify and annotate struc- 
tural protein repeats in a consistent way. It provides 
users with the possibility to access and download 
high-quality datasets either interactively or pro- 
grammatically through web services. 

INTRODUCTION 

A large portion of proteins contain repetitive motifs, 
which are generated by internal duplications and fre- 
quently correspond to structural and functional units of 
proteins. Many repetitions in protein sequences can be 
identified by using different approaches (1-4). A more 



difficult problem for identification is however posed by 
repeats in protein structure, which can be highly degener- 
ate (5,6). In fact, it is possible for a protein to maintain a 
repetitive structure even in the presence of massive 
amounts of point mutations (7). Several repeat families 
have been studied so far due to their relevance in different 
biological processes such as health (8), neurodevelopment 
(9) and protein engineering (10-12), to name just a few. 

Repeats have been previously divided into five broad 
classes, primarily as a function of repeat length (13,14). 
At the lower end of the repeat length spectrum, i.e. less 
than five residues, very short repeats can either form in- 
soluble aggregates (crystallites, class I) or long and 
winding helices of fibrous structures like collagen and a- 
helical coiled-coils (class II). At the other end of the 
spectrum, repeats containing >~50 residues appear to 
fold mostly as domains forming beads-on-a-string struc- 
tures (class V). In between, for unit lengths of 5^10 
residues, the known repeats can form either open 
elongated solenoids (class III) or closed toroids (class 
IV). Due to their fundamental functional importance, 
classes III and IV contain the most studied types of 
tandem repeat proteins. Solenoid folds appear to follow 
the distribution of repeat lengths rather closely, from all- 
beta (e.g. anti-freeze proteins) (15) to mixed alpha/beta 
(e.g. leucine-rich repeats) (16,17) to all-alpha structures 
(e.g. Armadillo and HEAT repeats) (18-20). They are 
characterized by some of the largest known autonomously 
folding domains, with 500 or more residues forming a 
single structure (21). Rapid addition or deletion of 
repeat units even between close homologs is of particular 
note for solenoid structures (22). Toroids on the other 
hand are restricted in overall size by their closed circular 
nature. Known toroid structures include the highly versa- 
tile TIM barrel and large outer membrane beta-barrels 
(23). Perhaps a more interesting fold is the beta propeller 
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(e.g. WD repeats), which can accommodate variable 
numbers of repeat units while maintaining a closed 
circular structure (24,25). 

An open question regarding repeat proteins is the exist- 
ence of other common structures that may have gone un- 
detected. After all, the most common way to detect repeat 
families so far was to manually annotate the sequence 
family first and only afterwards visually recognize their 
structural repetitiveness. Such an approach is obviously 
difficult when dealing with the entire Protein Data 
Bank (PDB) (26), especially considering the many 
uncharacterized protein structures deposited by the main 
structural genomics consortia (27). The systematic de- 
scription of repeat structures becomes a question of 
using automated methods to detect them in protein struc- 
tures. This field is relatively new, with only few available 
methods. One of the first attempts was made by the 
Thornton group (28), but is unfortunately no longer avail- 
able. Some methods (4,29-33) were developed to detect 
internal symmetries in proteins, but these may be difficult 
to adapt to the systematic classification of repeats. 
Recently, our group has developed RAPHAEL (34) in 
an attempt to fill the gap for repeat detection from struc- 
ture. Widely used structural classifications such as CATH 
(35) and SCOP (36) also do not explicitly annotate repeats 
in protein structures, although it may be possible to 
leverage individual annotations to find similar repeats. 
Some databases exist for the detection of repeats from 
sequence (37-39), but usually these are limited to short 
tandem repeats and do not take into account divergent 
repeats, such as solenoids or toroids. The main domain 
sequence databases such as Pfam (40) and SMART (41) 
do not excel at the annotation of these repeat types either, 
as coverage is rather low and many repeat units go un- 
detected. For Pfam most of the largest clusters of human 
sequence regions not covered were recently found to be 
repeats (42). To the best of our knowledge, no database or 
classification is currently available for repeat structures. 
This is the motivation for our present work, and we intro- 
duce RepeatsDB as a way to fill this gap. The database 
was developed to provide a central resource for the sys- 
tematic annotation and classification of repeats. Given the 
fact that the structure-based search and classification of 
repeat proteins is more complete than on the basis 
of sequences or key words, our database will allow more 
accurate assignment of proteins with repeats to the cor- 
responding families. For example, it will be used to 
suggest a better subdivision of alpha-solenoid proteins 
where at present the boundaries between the structures 
with Armadillo, HEAT, TPR and other repeat types are 
frequently blurred. 

DATABASE DESCRIPTION 

Data curation 

The initial dataset for RepeatsDB was extracted from the 
PDB (43). Repeat candidates were identified from the 
reduced PDB dataset with RAPHAEL (34), which uses 
a geometric approach imitating the work of a human 
curator (score cutoff >1). The resulting dataset consisted 



of > 10 000 repeat candidates, stored in the database as 
'predicted' entries, which underwent a classification and 
curation process. 

The dataset of predicted repeats was manually curated 
using a two-level annotation system. The first manual an- 
notation level ('manually classified') classifies an entry into 
structural repeat class and subclass. This classification is 
based on previous work (14), where five classes of repeat 
structures are proposed, which are then further divided 
into subclasses. Class assignment is based mainly on 
repeat unit length and subclass assignment on secondary 
and tertiary structure features. The second manual anno- 
tation level ('detailed') consists in providing information 
about the start and end positions of the repeat units, 
repeat regions and/or insertions. We define a repeat unit 
as the smallest structural building block that is repeated to 
form a repeat region. A repeat region is a group of at least 
three repeat units. Inclusion of proteins with two repeat 
units would significantly complicate classification because 
many typical globular domains have this type of architec- 
ture. Insertions are non-repeated segments of structure 
that occur either inside a repeat unit or between two of 
them. These are particularly interesting because they 
break the repeat symmetry, and represent a challenge 
both for automatic detection and for the analysis of 
repeat structures (34). 

Several curators annotated each protein undergoing 
manual classification by consensus. For first-level annota- 
tions, at least 75% of the curators had to agree in order 
for a protein to be included, otherwise it would be 
excluded and placed on a reserve list for future annota- 
tion. The rationale for this choice is that ambiguous cases 
are generally difficult to classify but may occasionally rep- 
resent a novel repeat class. For second-level annotations, 
the threshold for consensus was at least 65% agreement 
(typically two of three curators). In case of discrepancy, an 
expert would arbitrate the final annotation based on the 
alternative proposals. Proteins with detailed annotations 
were also used to search for similar sequences in proteins 
from the PDB. Any PDB chain with at least 40% sequence 
identity and a coverage of at least 80% of the classified 
protein, belonging to the initial list of predicted entries, is 
added to the 'classified by similarity' annotation level. The 
similarity thresholds were selected to exclude possible 
false-positives (data not shown). 

Implementation 

RepeatsDB was designed with a multi-tier architecture, 
using separate modules for data management, data pro- 
cessing and presentation functions. To simplify develop- 
ment and maintenance, all tiers handle the common J SON 
(JavaScript Object Notation) format, thereby eliminating 
the need for data conversion. The MongoDB database 
engine is used for data storage and Node.js as middleware 
between data and presentation. RepeatsDB exposes its re- 
sources through RESTful web services, by using the 
Restify library for Node.js. The Angular.js framework 
and Bootstrap library were selected to provide the 
overall look-and-feel. Angular.js to Bootstrap integration 
is available through the angular-ui project. A customized 
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version of the BioJS (44) sequence component is used as 
sequence visualizer. Additional information is added to 
entries by querying the PDB web services at the structure 
and chain level. At the structure level, annotations like 
organism and experimental method used when resolving 
the structure are provided. At the chain level, secondary 
structure and links to other databases, among others. 
RepeatsDB offers users both graphical web interface 
access and RESTful web services from URL: http:// 
repeatsdb.bio.unipd.it/ 

USING REPEATSDB 

The user interface presents an intuitive tree-based 
browsing mechanism, where the root of the tree is the 
full database, second-level nodes repeat classes and 



third-level nodes subclasses. When clicking on a node, 
the user is presented with the list of RepeatsDB entries 
corresponding to the selected category. Each row of the 
list shows basic information about the entry, like its entry 
ID, title and organism. All annotated chains correspond- 
ing to an entry are displayed in a single page. The user 
interface presents a structure and sequence visualization 
widget (Figure 1). The user may choose to visualize the 
structure in four static images, or by using the 3D visual- 
izer. If the entry features detailed annotations, the repeat 
regions, units and/or insertions are displayed using a com- 
bination of colours. The sequence visualization widget 
displays the sequence and secondary structure correspond- 
ing to the structure. It displays the same colour coding as 
the structure visualization widget, associating repeat an- 
notations in the structure and sequence views. Additional 



RepeatsDB 



Home Stats About Help FAQ Contact 



Browse Search v 



Entry 1 iknD (Detailed) 



Download JSON XML TAB 0 



Annotation level 

This entry features detailed manual annotations 



Sequence length 

236 



Repeat regions 

• III.3 (o-solenoid) [74-274] 
o Average period: 33. S 



RAPHAEL results 

SVM score: 5.5068486 
Predicted period: 34.0272 



Domains view 

Pfam 

RepeatsDB — I 



Organism 

Homo Sapiens 



Host 

Escherichia Coli Bl21(de3) 



Sequence view 



Structure view 
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Additional information 

Entry ID 

1iknD 
Macro molecule type 

Polypepttde(L) 



Molecule 

PROTEIN (l-KAPPA-B-ALPHA) 
Method 

Not available 



Cross links 

• PDB: 1 1kn 

• CATH: 

o 1.25.40.20173-293] 

• SCOP: 

o 19171 

• Pubmed: 986S694 



RepeatsDB created by Tomas Ol Domenlco and Emillo Potenza for the BloComputing Lab in the University of Padua. Italy. 



Figure 1. Screenshot of a sample RepeatsDB entry results page (PDB entry likn). The sequence viewer and the structure viewer are shown in the 
middle of the page, towards the left and the right, respectively. Additional annotations at the structure and chain level are displayed, including links 
to other databases (above) and classifications (below). 
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information at the structure and chain levels is also 
provided. 

The RepeatsDB search toolbar, available on top of 
every page, allows to search for entries either by 
database IDs or UniProt text query. The database ID 
search allows comma-separated PDB or UniProt IDs. 
The UniProt text search query uses the full UniProt 
search engine, see online documentation. RESTful web 
services are directly accessible through HTTP URLs. All 
data available on RepeatsDB are also available for pro- 
grammatic access. Please refer to the 'Help' section of the 
website for details on using the RepeatsDB web services. 
Datasets can be downloaded in JSON, XML or text 
format using the browse function or RESTful web 
services. 

Statistics 

Analysis of the full PDB dataset yielded 10 745 repeats 
predicted by RAPHAEL, of which 2797 were finally clas- 
sified into the RepeatsDB schema. Table 1 shows the dis- 
tribution between classes and subclasses. The bulk of the 
annotations (~90%) consist of entries belonging to classes 
III and IV. No effort was made to balance the distribution 
of entries between classes in this initial release. As 
coverage increases in the future, we expect the balance 
to approximate the real distribution more closely, 
although it may be necessary to fine-tune RAPHAEL. 
Of the classified entries, 321 representatives of the entire 
dataset were annotated in detail with information about 
the start and end of repeat regions, repeat units and/or 
insertions (Table 1). It is interesting to note the different 
distribution of insertions between classes. Apparently, 
some classes such as (3-solenoid (class III. 1 ) or TIM 
barrels (class IV. 1) have stronger propensity to accommo- 
date insertions. 



CONCLUSIONS AND FUTURE WORK 

RepeatsDB's goal is to provide the community with a 
resource for high-quality tandem repeat protein structure 
annotations. The user can either interactively analyse his 
proteins of interest via the user interface, or create and 
download datasets for offline use. Far from being a 
static classification process, the annotation effort for the 
initial RepeatsDB dataset alone already motivated the ex- 
tension of the original classification schema (14). Some of 
the curated structures, while clearly representing structural 
repeats, did not belong to any of the pre-defined 
subclasses. To allow them to be classified, subclasses 
IV. 5 (a/P prism) and IV. 6 (a-barrel) were added to the 
initial schema (14). Class V also underwent a re-classifica- 
tion according to the secondary structure content of the 
single domain repeats ('beads') to allow a broader classi- 
fication range beyond individual repeat families, as the list 
of possible beads-on-a-string folds may be considerably 
larger than currently appreciated. The 'other' subclass 
was also added to allow collection of repeats that do not 
fit into the current classification scheme. RepeatsDB 
provides the community with a previously unavailable op- 
portunity to easily create datasets of tandem repeat 
proteins. The detailed annotation subset further presents 
a unique opportunity to better understand the nature of 
tandem repeat proteins. 

Beyond its initial release, RepeatsDB is a continuous 
effort to expand, revise and improve tandem protein 
repeat annotations. Predictions for new PDB structures 
are simple and fully automated, allowing regular 
database updates every 3 months. Manual curation of 
new entries for inclusion is also ongoing, aiming at 
regular and steady updates. Options to involve the com- 
munity into the annotation process through crowd- 
sourcing tools are currently being analysed. A main goal 



Table 1. Statistics for RepeatsDB 



Subclass 


Name 


Detailed 


Classified 


Classified 


Predicted 








(manually) 


(by similarity) 




I.I 


Poly-alanine p structure 


0 


0 


0 


0 


11.1 


Collagen triple-helix 


0 


5 


0 


0 


IL2 


a helical coiled coil 


23 


38 


69 


0 


III. 1 


P-solenoid 


43 


113 


21 


0 


III.2 


oc/P solenoid 


21 


43 


27 


0 


III. 3 


a-solenoid 


48 


246 


631 


0 


III.4 


Trimer of p spirals 


7 


0 


13 


0 


III. 5 


Single layer anti-parallel p 


4 


3 


0 


0 


IV. 1 


TIM-barrel 


84 


118 


626 


0 


IV.2 


P-barrel 


8 


1 


8 


0 


IV. 3 


P-trefoil 


20 


0 


29 


0 


IV.4 


P-propeller 


40 


182 


227 


0 


IV.5 


a/P prism 


0 


17 


0 


0 


IV.6 


a-barrel 


6 


0 


0 


0 


V.l 


oc-beads 


2 


1 


0 


0 


V.2 


P-beads 


29 


12 


71 


0 


V.3 


cx/P-beads 


3 


3 


1 


0 


V.other 


Unknown subclass 


3 


0 


4 


0 


UA 


Unassigned 


0 


0 


0 


7948 




Total 


321 


749 


1727 


7948 



The subclass name is shown together with the number of entries on each of the four annotation levels. Note that 'Unassigned' entries are auto- 
matically predicted by RAPHAEL and therefore not assigned to a specific class. 
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for future versions is the extension of the annotation of 
repeats at the sequence level, starting from annotation for 
intrinsically disordered regions from MobiDB (45). We 
anticipate that RepeatsDB should prove valuable 
towards the understanding of the sequence-structure rela- 
tionship in tandem repeat proteins and their evolutionary 
relationship. 
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